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j^ ' Abstract 

, ^ , , Variable selection is recognized as one of the most critical steps in statistical mod- 

I eling. The problems encountered in engineering and social sciences are commonly 

\^J ' characterized by over-abundance of explanatory variables, non-linearities and unknown 

\^ . interdependencies between the regressors. An added difficulty is that the analysts may 

C"^ I have little or no prior knowledge on the relative importance of the variables. To pro- 

^ ■ vide a robust method for model selection, this paper introduces a technique called the 

•^ . Multi-objective Genetic Algorithm for Variable Selection (MOGA-VS) which provides 

f— V I the user with an efficient set of regression models for a given data-set. The algorithm 

CN ' considers the regression problem as a two objective task, where the purpose is to choose 

'T; ' those models over the other which have less number of regression coefficients and better 

^ [ goodness of fit. In MOGA-VS, the model selection procedure is implemented in two 

^ ' steps. First, we generate the frontier of all efficient or non-dominated regression mod- 

H ■ els by eliminating the inefficient or dominated models without any user intervention. 

- - - Second, a decision making process is executed which allows the user to choose the most 
preferred model using visualizations and simple metrics. 

1 Introduction 

Model selection task is ubiquitous in many branches of science. Investigators are often 
interested in finding the best predictors for the dependent variable which lead to a good 
quality of fit and parsimony. A compromise is to be made between fitness and parsimony, 
as inclusion of too many predictors lead to loss in precision of the regression coefficients and 
omitting ir nportant fa c tors le ad to a mis-estimation of the regression coefficients and biased 
prediction ( JMurtaughl (1l998l )). This trade-off makes a model selection task a two objective 



problem. However, most of the existing approaches have handled the model selection task as 



a single objective problem by usin g various pena li zed model selection crite r ia fsu c h as AIC 



and BIC): see e g Ueffrevsl (1196111: iMillerl (120021) :lBurnham and Anderson! fl2004l ): iMacKav 



(120031 ): iGregoryl (120051 ): IZhu and Chipmanl (120061 ) and references therein. Despite a lot of 



work in the direction of model selection techniques, there is no single method which can 
be utilized for all the problems. This is explained by the fact that model selection task is 
inherently not a single objective problem with a uniquely defined solution. Instead, each 
selection criterion or single objective method is bound to produce different results, because 
they work by giving higher or lower importance to either fitness or parsimony. 

In this paper, we propose a multi-objective genetic algorithm for variable selection 
('MO GA-VS), w h ich draws ins i ghts f rom the advances in the field of evolutionary computa- 
tion (iDebl (l200lh : ICoelloet all (l2002h l In MOGA-VS, the model selection task is considered 
as a multi-objective optimization problem, where the first objective is to reduce the com- 
plexity of the model (or reduce the number of coefficients) and the second objective is to 
maximize the goodness-of-fit (or minimize mean squared error). By doing so, the suggested 
approach differs from the existing methods in two important ways. First, instead of at- 
tempting to arrive at a single model candidate, the method produces a collection of efficient^ 
regression models from which the most preferred model can be chosen. Hence, an essential 
benefit in MOGA-VS is its inbuilt ability to handle model uncertainty. The second differ- 
ence follows from the separation of optimization process from choosing a particular trade-off 
between goodness-of-fit and model parsimony. The problem of finding all optimal trade-offs 
is performed without any user-intervention, whereas the task of selecting an optimal balance 
between the two objectives is best left as a user's preference-based decision. In MOGA-VS, 
the decision making process is guided by using a combination of visual tools and metrics. 

In practice, model selection often involves considerable trial and error by the user, where 
various model specifications are examined before arriving at a satisfactory candidate. Most 
of the times, the user ends up with more than one model of his liking and then resorts to a 
model selection criteria to choose the better one. Different model selection criteria producing 
different results is one of the problems which is difficult to avoid, but another significant 
problem is that the user might end up comparing models which are actually dominateco, or 
in other words, worse in terms of complexity as well as fit. The MOGA-VS algorithm solves 
the problem of ending up with a dominated model entirely by ensuring that the models being 
compared are efficient. This means there is no other model which is less complex and can 
provide a better fit. To evaluate the performance of our approach, the method is tested on 
both simulated and real datasets. It is shown that it is not wise to resort to a particular 
model selection criteria while selecting a model as various penalized model selection schemes 
act as different value functions in a multi-objective domain and represent any one of the 
models from the efficient set. MOGA-VS, on the other hand, provides the entire efficient 
set of models to the user, so that an assessment can be done and the most preferred model 
can be selected. Another advantage of this procedure is that the set of efficient models is 
obtained by the algorithm without any prior information about the data-set and minimal 
user intervention. 



^Thc notion of model efficiency is used synonymously with Pareto-optimality. Further discussion on 
optimality in multi-objective problems is provided in Section [2j 

^For a bi-objective case, where the objectives are complexity minimization and fitness error minimization, 
a model is said to be dominated by another model, if it is worse in terms of complexity as well as fit. 



The rest of this paper is organized as follows. Section [2] provides a summary of the 
central definitions and an overview of the model selection problem within multi-objective 
framework. Section |3]gives a literature review on commonly applied model selection methods 
and discusses their differences to multi-objective optimization framework. The proposed 
MOGA-VS algorithm is presented in Section HJ A brief description of the underpinnings of 
genetic algorithms is also provided. Section O presents the results from experiments with two 
different dataset. One of the datasets is a recently published real dataset on Communities 
and Crime within United States. Comparisons with respect to well known variable selection 
techniques are included in the study. 

2 Model Selection as a Multi-objective Problem 

We begin with a quick review on the model selection task in order to introduce the main 
concepts and the notation used in multi-objective problems. We also present a summary of 
the stages included in choosing an optimal model under the multi-objective paradigm. 

2.1 Trade-off between approximation and estimation error 

The regression modeling task can be viewed as a special example of supervised learning. Let 
y be the output space and let X = HiLi '^i denote the input space, where Xi is domain of 
the i-th explanatory variable and p is the total number of variables. Given a collection of 
data (Xj, Yi) G X xy, i = 1, . . . , n, with an unknown probability distribution V, the purpose 
is to find a predictor f : X ^ y with minimal error on the training set with respect to V. 
To restrict the search space, the predictor is assumed to belong to a pre-defined hypothesis 
space Ti. For linear regression, the hypothesis space can be written as the set of all linear 
functions that can be formed using some subset of the variables contained in the input space, 

1-1= <x ^ y^^/3kXk I J C {l,...,p},Xke Xk 
I fceJ 

The model selection problem follows from the fact that the hypothesis space consists of 
models with varying complexity. In the case of regression modeling, the hypothesis space Ti 
forms a nested structure "Hi C 7^2 C ■ ■ ■ Tid C • ■ ■ C "H, where Tid represents the subset of 
models with d many variables. This means that in order to find a preferred predictor, we 
need to choose the size of the hypothesis space that provides a good balance between the 
approximation error (the error caused by restricting Ti) and the estimation err or (the error 



caused by learning the predictor from a finite training sample). As discussed by lAndo et al 



(120051 ) . among others, it is well known that a fixed sample size having a smaller hypothesis 
space helps to reduce the estimation error while the accuracy of the predictor suffers. Hence 
solving the model selection task is equivalent to considering a multi-objective optimization 
problem with two conflicting objectives. 



2.2 Multi-objective formulation and optimality 

A multi-objective optimization problem has two or more objectives which are conflicting. The 
objectives are supposed to be simultaneously optimized subject to a given set of constraints. 
These problems are commonly found in the flelds of science, engineering, economics or any 
other fleld where optimal decisions are to be taken in the presence of trade-offs between two 
or more conflicting objectives. 

By interpreting the model selection problem as flnding a trade-off between approximation 
error and estimation error, we can formulate the following two objective problem where the 
two types of error are jointly minimized. 

Definition 2.2.1 (Multi-objective problem) Let (^ : H —^ N x M., Lp = (Lpi,(p2) denotes 
an objective vector, where 

(i) the first objective ipi : Ti ^ N, fi{f) = m.m{d G N : / G Tid} represents the complexity 
of a model in terms of the number of variables in the predictor; and 

(a) the second objective (p2 : Ti ^ M. is the empirical risk (p2{f) = -Yl^=i^if0^i)^'^i)' 
with quadratic loss function L{f(X.i),Yi) = (Yi — /(Xj))^. This is same as the mean 
squared error. Some other suitable objective function may also be considered instead of 
the mean squared error. 

Then the optimization problem is given by 

mimmize (^(/) = ((^i(/), (^sl/)) , 
subject to f G C. 

where C C Ti is a constraint set. 

Usually, multi-objective problems do not have a single optimal solution which simultane- 
ously maximizes or minimizes all of the objectives together; instead there is a set of solutions 
which are optimal in the sense that they are not dominated by any other solution. Once 
the models with best-flt corresponding to different complexities are available, the user could 
make the choice for the most preferred model. 

Definition 2.2.2 (Dominance) A predictor /^^^ is said to dominate the other predictor 
f^^', denoted as f^^'' >- f^^', if both conditions 1 and 2 are true: 

1. The predictor f^^'^ is no worse than /*^^^ in both objectives, or ^jif^^'^) < V^i(/^^''); 
J = 1,2. 

2. The predictor f^^'^ is strictly better than /*^^^ in at least one objective, or (fj{f^^^) < 
ipj{f^'^^) for at least one j G {1, 2}. 

The idea is illustrated in Figure [1] for a two objective minimization case. Let us consider the 
point A as a reference point. The dominance relationship of different regions with respect 
to A are marked by different shadings. The shaded region in the south-west corner marks 
the area which dominates point A. The area with a lighter shading in the north-east corner 
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Figure 1: Explanation of the domination 
concept for a minimization problem, where 
the reference point A dominates B. 



Figure 2: Explanation of the concept of 
a non-dominated set and a Pareto-optimal 
front. 



is the region dominated by the reference point A. The remaining unshaded area represents 
the non-dominated region. 

The concept of dominance gives a natural interpretation for optimality in multi-objective 
problems, because the quality of any two solutions can be compared on the basis of whether 
one point (predictor) dominates the other point (predictor) or not. 

Definition 2.2.3 (Non-dominated set and Pareto-optimality) Among a set of solu- 
tions V cTi, the non-dominated set of solutions V* are those that are not dominated by any 
member of the set V, i.e. 



V^ = {feV\heV : gyf}. 



When the set P is the entire search space, i.e. P = Ti, the resulting non-dominated collection 
of predictors P* is called the Pareto-optimal set "H*. 

To visualize the idea of Pareto-optimality, Figure |2] shows an example of a minimization 
problem with two objective functions. The shaded region in the figure represents the image 
of the feasible region in the search space, i.e. ^piT-L) = {</'(/) : / G ^}- The bold curve 
marks the Pareto-optimal set, ip{W), which represents all the optimal points in the two 
objective minimization problem. To understand the difference between Pareto-optimality 
and non- dominance, the figure shows also a set of points corresponding to the objective 
function values of a finite collection of other solutions. Let us denote this group by '^(V). 
Among these points, the ones connected by broken line are the values of solutions in V* 
which are not dominated by any point in the given finite set displayed on the figure. Hence, 



although none of these points are Pareto-optimal (because V^nTi* = 0), they still constitute 
a non-dominated set with respect to the finite set V. The other points which do not belong 
to the non-dominated set are dominated by at least one of the points in the non-dominated 
set. Therefore, the difference between an arbitrary non-dominated set, such as V*, and the 
Pareto-optimal set Ti* is that, in order for a solution to be considered Pareto-optimal it must 
be globally non-dominated in the entire search space. 

2.3 Choosing an optimal model in multi-objective framework 

Having discussed the meaning of optimality in the context of multi-objective optimization, 
it is clear that solving a multi-objective problem is fundamentally different from classical 
single-objective optimization. The commonly applied model selection techniques attempt to 
arrive at a single optimal model, whereas solving the multi-objective problem 12.2.11 leads to a 
collection of predictors which are globally non-dominated. That is, each of them represents an 
optimal trade-off between model parsimony and goodness-of-fit. What remains is a decision- 
making problem. Out of the multiple optimal solutions, the user should select the most 
preferred model according to his own view on desirable trade-off. Consequently, a multi- 
objective approach to optimal model selection can be naturally represented in the following 
stages: 

Stage 1: Finding the Pareto-optimal models %* . Solving the problem 12.2.11 is a computa- 
tionally demanding task. However, the recently developed techniques for evolutionary 
computation can handle the optimization problem in an efficient manner. A detailed 
description of the approach suggested in this paper is provided in Section HJ 

Stage 2: Choosing an optimal model with preferred trade-off. Once the collection of Pareto- 
optimal models is known, it can be graphically analysed to get a better understanding 
of the trade-off between tpi and (^2- The proposed techniques for analysing both the 
objective space (p(T-L*) and the predictor set Ti* are discussed in Section [^^21 Based on 
the understanding, the user can choose one or more Pareto-optimal model for further 
evaluation. The selection rules are discussed in Section 14.31 

The multi-objective framework is focused on finding the Pareto-optimal solutions and 
leaves the final model choice as a preference-based decision-making problem. On the other 
hand, the classical methods, instead of revealing the Pareto-optimal frontier, impose an a 
priori assumption on the way in which the confiicting objectives should be balanced (e.g. 
by using a penalizing scheme). By doing so, the classical approaches effectively reduce the 
multi-objective optimization problem into a single objective optimization problem that yields 
only one model candidate as a solution. To discuss its differences with the multi-objective 
framework, an overview of commonly applied model selection techniques is given below in 
Section [31 

3 Review on Model Selection Methods 

A number of model selection criteria and methods have been suggested in the recent literature 
on statistical modeling and machine learning. However, given the lack of any clear standard, 
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none of the methods has become dominant, and this leaves the user puzzled as to which 
approach to use. Most of the times, each of these selection criteria or methods lead to a 
different solution which makes it difficult for the user to pick up a model. Below we will 
discuss the various approaches and their relationship to the multi-objective framework. The 
models have been roughly categorized into four groups: (1) we begin the overview with a 
discussion on the various penalized model selection criteria available in the literature which 
help a user to compare a set of regression models; (2) next, a review on the various step- 
wise model selection schemes is provided; (3) in the third subsection, we discuss some of 
the heuristic approaches which are currently used for regression model selection; (4) the last 
group covers a few recently introduced Bayesian variable selection techniques. The section 
concludes with a summary of the central differences between the classical methods and the 
multi-objective framework suggested in this paper. 



3.1 Selection by Complexity Regularization 

Talking about the penalized model selection criteria, it can be found that there exist a num- 
ber of model selection criteria in the literature. However, there are two criteria which are very 
commonly used. One is an information-theoretic method pioneered by lAkaikd (Il974j ). known 
as the Akaike Information Criteria (AIC) and the other on e uses the Bayesian evidence, 
known as the Bayesian Information Criteria (BIC) ( Schwarzl (119781 )). A model which gives 
the least value for the criterion is the most preferred one. There are many other information 
criteria which are not commonly used and have been derive d using similar prin c iples a s AIC 
and BIC. They are Deviance Information Criteria (DIC) ( Spiegelhalter et al.l ( 2002)), Ex- 



pected Akaike Information Criteria (EAI C), Fisher In formation Criteria (FIC) ( IWeil (119921 )). 
Generalized Iii f ormat ion Criteria (GIC) ( Nishiil ( 1984J )). Network I nformatio r i Crit eria (NIC) 
( Murata et al.l ( 199ll )) and Takeuchi Information Criteria (TIC) ( Takeuchil ( 19761 )). 

The classical information criteria can be essentially viewed as various forms of complexity 
regularization scheme, where the purpose is to penalize complex models based on their 
information content or using prior knowledge. In general, the choice of model by complexity 
regularization can be understood as solving a single objective minimization problem. 



In 



argmm 



Uf) := RnU) + \C{f) 



(3) 



where -R„ : 7{ — )• M denotes the empirical risk (e.g the function yji in Problem 12.2. ip . and 
C :?/—)■ M represents the cost of model which is commonly expressed in terms of the model 
size and sample size. For example, the use of Akaike's Information Criterion corresponds to 
minimizing a single objective function, where Rn{f) = ~21og(v92(/)) and C(/) = 2Lpi{f)/n 
with A = 1. Hence the choice of penalty scheme (or information criterion) is equivalent to 
solving a multi-objective optimization problem where the preferred trade-off between the 
objectives (pi and (p2 is given a priori. The obtained solution to (^ corresponds to a single 
point from the Pareto-optimal frontier "H*. 

In practice, however, the users of information criteria rarely attempt to solve the prob- 
lem ([3]) in a rigorous manner. Instead, they end up comparing some subset of models P cTi 
which are not necessarily optimal. More disciplined approaches that aim at finding an ap- 



proximate solution to the complexity regularization have utilized stepwise selection methods 
or genetic algorithms discussed below. 



3.2 Stepwise Selection Methods 

Stepwise methods are commonly used to select the predictor variables in a regression model. 
The methods commonly used are forward selection, backward elimination and stepwise re- 
gression. Forward selection method adds variables to the model until no remaining variable 
(outside the model) can add anything significant to the dependent variable. Forward se- 
lection starts with no variable in the model. Backward elimination is opposite to forward 
selection where variables are deleted one by one from the model until all remaining variables 
contribute something significant to the dependent variable. Backward elimination begins 
with a model which includes all the variables. Stepwise regression is a modification of the 
forward selection method in a way such that variables once included in the model are not 
guaranteed to stay. A det ailed dis c ussion on these approaches and their weaknesses can be 
found in a recent study by iRatnerl (120101 ) . 



3.3 Genetic algorithms and other heuristics 

A number of studies use genetic algorithms (GA) and other heuristic algorithms to choose 
regressors in a regression problem. Some of the s t udies t o the knowledge of the authors are 
Paterlini and Minerval (l2010| l: lBroadhursta et all (1l997h : [GiUi and Winkei] (l2009h . However, 
they differ from the method proposed in this paper as they assume a single objective function 
(usually an information criteria) and then use the heuristic algorithm to find an optimal 
regression model which optimizes the chosen objective. For ex ample, one such al g orithm is 
the Parallel Geneti c Algorithm (PGA) f ramework suggested by lZhu and ChipmanI (120061 ). A 
recent heuristic by IWolters et al.l (1201 if ) proposes a non- convergent approach for generating 
a large number of models for a fixed model size. Thereafter, a feature extraction problem 
is solved to choose the most appropriate model. This study differs from ours, as we target 
the entire set of Pareto-optimal models in a single run of the algorithm. In the process of 
converging towards the best-fit models, we also get a high number of dominated models close 
to the Pareto frontier as a by-product of our optimization scheme. 



3.4 Best Subsets Method 



In the best subsets method, usually an exhaustive or a branch and bound algorithm (iFurnival and Wilson 
(119741 )) is used to find the best models corresponding to fixed number of variables. The best 
subset selection finds the model with the greatest goodness-of-fit, for a fixed number of 
variables. When repeated for different number of variables, this procedure yields a set of 
efficient solutions similar to what we are aiming for. The algorithm to find the best subsets 
becomes computationally very expensive with increasing number of variables and is not a 
viable technique when the number of variables are very high. 



3.5 Bayesian Model Averaging 

An alternative to frequentist approaches for model s e lectio n is the use of techniques de - 

jll994); 



veloped for Bay e sian model averaging f BMA) (ILeamerl (ll978f):lMadigan and Raftery 



Montgomery and Nyhan! (J2010l ): IClyde et al. 



Chatfieldl fll995h : lHoeting et al.l fllQQQf ): 
technique for model selection. In our experiments, we consider one such method where BMA 
is used to rank models and uses the all subsets method. The BMA technique computes the 
full joint posterior distribution over models which allows incorporation of model uncertainty 
in posterior inferences. The posterior distribution over models is given by 



p{f\Y) 



p(Y\f)p{f) 



(4) 



where p{Y\f) = J p{Y\6f, f)p{6f\f)d6f is proportional to the marginal likelihood of / and 
6f is the parameter vector for model /. Commonly the posterior is constructed under the 
assumption of normality in the regression model. The quality of the models can then be 
compared in terms of their posterior probabilities. For instance, when searching for an 
optimal model, a comm on strategy in BMA is to select the highest posterior probability 



model. As discussed by I Clyde et al.l (120 lOl ). there are several other strategies to perform 
optimal model selection e.g. based on maximization of posterior expected utility. However, 
the difficulty in BMA is that when a large number of variables is involved, enumeration of the 
models in the hypothesis space "H becomes a heavy task. Therefore, the use of Markov Chain 
Monte Carlo techniques or adaptive sampling is necessary even for problems of moderate 
size. Bayesian model averaging technique could also be used with our algorithm for selecting 
the best model from the non-dominated set of models. 



3.6 Central differences and motivation for MOGA 

Both classical and multi-objective approaches have their pros and cons. The classical scheme 
is optimal if the chosen penalty scheme is a good representation of the user's preferences for 
trade-off between empirical risk and model complexity. However, many a times, the model 
selection can turn ou t to be quite sensitive to the c hoice of complexity penalty. Further- 
more, as discussed by Montgomery and NyhanI (J2010l ). uncertainty about the correct model 
specification can be very high in practical applications. For instance, in social sciences such 
as political research, where large sets of control variables are involved, an attempt to find a 
single best model is often poorly justified. 

The multi-objective framework proposed in the present paper differs from the classical 
model selection techniques in the following respects: 

(i) Multiple optimal solutions: By treating the model selection task as a multi-objective 
optimization problem, we are always looking for a collection of Pareto-optimal solutions 
instead of attempting to choose one single optimal point directly. The Pareto set 
contains the best solution in terms of goodness-of-fit for each complexity. Therefore, 
these set of optimal solutions guarantee that for a given number of variables, there 
cannot exist a model which can provide a better fit for the training data. 



(ii) Separation of concerns: The purpose in multi-objective approach is to avoid making 
an a priori choice of a complexity penalty. To accomplish this, a distinction is made 
between stages which can be objectively decided and those which are more dependent 
on the user's preferences and the particular application at hand. Finding the Pareto- 
optimal frontier is an optimization problem that can be solved without any a priori 
assumptions, whereas the choice of the preferred point (s) from the Pareto-optimal set is 
both preference as well as application dependent question. Therefore, in the proposed 
approach, the optimization stage, and decision-making stage are treated separately. By 
doing so, the multi-objective technique enhances understanding of the trade-off and 
what separates the alternative predictors. 

The remaining question is how to find the Pareto-optimal solution. Of course, for a finite 
search space "H, it is always possible to use a brute force to find the Pareto-optimal set. 
However, such a naive approach would be intractable in practice. To solve the optimiza- 
tion problem in an efficient manner, our approach introduces a specialized multi-objective 
optimization framework that is based on evolutionary computation. 

4 The MOGA-VS Framework 

In this section, we discuss a step-by-step procedure for the Multi-objective Genetic Algo- 
rithm for Variable Selection (MOGA-VS). The framework of this algorithm ha s been in- 



spired by some of the exist ing evolutionary multi-objective (EMO) procedures (JDeb et al. 



( 2OO2I ) : IZitzler et al.l ( 200l[ )). The presented algorithm has been specialized to handle the 



problem 12.2.11 of variable selection efficiently. This section first provides a step-by-step pro- 
cedure for the proposed algorithm (MOGA-VS). Then, the techniques used for visualizing 
the Pareto-optimal frontier and selection criteria are discussed. 

4.1 Step-by-Step Procedure for MOGA-VS 

Using the basic genetic algorithm framework, we suggest a specialized algorithm for pro- 
ducing the efficient set of regression models when one objective is minimization of number 
of variables and the other objective is minimizing the mean squared error (other empirical 
risk measure may also be used). It should be noted that whenever we refer to a population 
member it means we are referring to a regression model. Each member /regression model is 
represented by a binary string of the size of number of maximum variables. If a particular 
variable is present, the bit value is 1; otherwise the bit value is 0. For example, if there are K 
number of maximum variables (xi, X2, . . . , x^^) then the string (1, 0, 0, 1, ... , 1)k represents 
a regression model where the first variable is present, second is absent, third is absent, fourth 
is present and so on. Sum of the bits (number of variables present in the model) in the string 
represents the first objective and the mean squared error of the regression model represents 
the second objective. 

A step-by-step procedure for the Multi-objective Genetic Algorithm for Variable Selection 
(MOGA-VS) is described as follows: 

1. Initialize a parent population, V, of size N by randomly picking the regression variables 
for each of the members. 

10 



2. Find the non-dominated set of solutions in the population, i.e. V*u 

3. Pick up any member from the non-dominate d set V* a n d ano ther member randomly 



from V to perform a single point crossover ( IGoldberej (Il989l )) of the binary strings 
leading to two offsprings. Repeat the process with different parents until A offspring 
members are produced. Add the offspring members to the set O. 



4. Perform a binary mutation (JGoldbergI ( 119891 )) on each of the offspring members in set 



O by flipping the bits with a particular probability. 

6. Add all the offspring members from the set O to V. The size of V exceeds N, therefore 
delete dominated members with highest number of variables until the size of V becomes 
equal to N. In case all the members are non-dominated, then delete the members with 
highest number of variables. 

7. If specified number of iterations, i, are done then terminate the process and report the 
non-dominated set from V else go to step 3. 

Choosing non-dominated parents for crossover, helps the algorithm in exploring members 
which are closer to the Pareto-optimal front. The output of the above algorithm is a non- 
dominated set of regression models V*, which provides an approximation for the Pareto- 
optimal frontier Ti* of the entire hypothesis space. Once these models are available, the 
Pareto-optimal frontier needs to be explored to find the most preferred points. This can be 
done using a combination of graphics (Section H^ and simple selection metrics (Section [4. 3p . 

4.2 Visualizing the Pareto-optimal frontier 

In order to get a quick overview of the obtained solutions, a commonly applied strategy is 
to construct an illustration of the Pareto-optimal set in the objective space. In MOGA-VS 
framework two types of graphs are considered: 

(i) Objective Space (OS)-plot: The Pareto-optimal frontier obtained as a solution to Prob- 
lem 12.2.11 is a plot where the empirical risk (fi of the efficient models is presented as 
a decreasing function of model complexity, i.e. {{^iif),^2if)) G N x M : / G "H*}. 
The plot can be used for analysing the trade-off between empirical risk and complexity 
before choosing one of the models, (see Section 1^^ . 

(ii) Hypothesis Space (HS)-plot: To get an idea on the structure of the Pareto-optimal 
models, i.e. what variables and how many are contained in them, a quick remedy is 
to consider a HS-plot which is reminiscent of a Gantt-chart in the hypothesis space. 
In HS-plot, the y-axis shows the variables contained in the Pareto-optimal models, 
and X-axis shows the optimal models as ordered according to their complexity, i.e. if 
the input-space X has p- variables, HS-plot corresponds to the set {(v?i(/),Xfcj) : k G 
{l,...,p},/ G "H*} where x^j G {0,1} is an indicator for whether / has the A;-th 
variable or not. Green-colour in the chart indicates presence of a variable. 



•^Non-dominated members from a particular set could be identified by performing pairwise comparisons 
between all the members and selecting the ones which are not dominated by any member. 
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Illustrations of the graph-based tools and their use are discussed in the light of experiment 
studies in Section [5l 

4.3 Selecting efficient models 

The graphical representations of the Pareto-optimal frontier can be used in conjunction with 
other criteria to decide which of the optimal models to choose for further examination. Some 
of these strategies are discussed below. 



(i) Knee- p oint s t rategy : Observing a knee-point ( iBechikh et al.l (|2010[ ) ; iDas I (Il999[ ) ; iBranke et al 
( I2OO4J ): iDebl (l200l[ )) in the OS-plot can be considered as an indicator for an optimal 



degree of model complexity. A "knee" is interpreted as a saturation point in terms of 
goodness-of-fit vs complexity, where further increase in model complexity yields only 
minor improvement in fit. As demonstrated in Section |5l this strategy appears to work 
quite well in many problems despite its simplicity. 

(ii) Bayesian statistics: Another strategy is to consider the use of Bayesian Model Aver- 
aging approach along the Pareto-optimal frontier only. This would allow the user to 
select more than one optimal model to perform statistical inference. For example, if 
B* C W is a neighborhood of models surrounding the knee-point of the optimal fron- 
tier, the user might want to combine several models to perform posterior inferences 
on a given quantity of interest A, i.e. p(A|F) = ^ rgg*p(A|/, F)p(/|F). This is an 
appropriate strategy in particular when the user has prior information. 

(iii) Information criteria: The efficient frontier can be also explored using various informa- 
tion criteria discussed in Section |3l Applying different information criteria, to these 
optimal models, lets the user know as to which of the criteria agree with each other 
and which do not. 

(iv) F-tests: In case the user finds that several of the Pareto-optimal models are worthy 
candidates for further evaluation, then non-nested F-tests or encompassing F-tests 
between the competing spec ifications can be considered. Mor e details on non-nested 



testing can be found e.g. in iDavidson and MacKinnon! (J2004J ). 



5 Results 

We provide the results on two different datasets in this section. The evaluation of MOGA-VS 
is first performed on a simulated dataset for which the true model is known. Thereafter, the 
procedure is evaluated on a recently published communities and crimes datasecl within the 
United States. The purpose is to find the attributes that best explain the total amount of 
violent crimes per lOOK population. The section provides a comparison of the MOGA-VS 
framework with other state of the art techniques. 



*http://archive.ics.uci.edu/ml/datasets/Conimunities+and+Crinie 
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Figure 3: The figure shows a part of the 
MOGA-VS frontier and a part of the Lasso 
frontier obtained using the simulated dataset 
from a sample run. The true model is also 
plotted. 



Figure 4: The figure shows the models ob- 
tained from 10 different runs of MOGA-VS 
and Lasso for the simulated dataset. The true 
model is also plotted. 



5.1 Simulated Example 

We provide the results obtained from a simulated example with 100 variables and 500 obser- 
vations. To increase the difficulty of the problem we have made all the 100 variables highly 
correlated by using the following mechanism: 



Xj — 2Z -f Ei] 



1,2 100, 



Si, Z ~ N, 



500 



;o,i). 



(5) 



This introduces a pairwise correlation among all the variables as 0.80. The response variable 
is then constructed as follows: 



Y = O.lXi + 0.2X2 + 0.3X3 + . . . 1.0X10 + e; 



A^. 



500 



;0, a'l) 



a 



1. 



(6) 



Once the response and predictor variables are generated, they are fed into the MOGA-VS 
algorithm. The algorithm is executed for 500 generations and produces a Pareto-frontier of 
efficient models with complexities varying from 1 to 100. A part of the frontier produced by 
the algorithm is shown in Figure [3] for a dataset. We have performed a comparativ e study , 
where we examine the performance of our method against the Lasso ( jTibshiranil f l996l )) 
scheme. The Lasso frontier is generated by solving a number of single objective optimization 
problems with different parameter valueCl- Figure [3] also shows the frontier obtained from 
the Lasso scheme using the same dataset. It can be observed that the models generated 
using Lasso are far away from the MOGA-VS frontier. We have performed a simulation 
study where we execute each of the methods on 10 different datasets to observe the precision 
and accuracy. Figure H] shows the results obtained from 10 sample runs of both the methods. 



^The Lasso parameter was incremented from in steps of 0.01, and a singe objective optimization problem 
was solved for each parameter until a model is obtained which includes all the variables. 
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It is easy to observe the better performance of MOGA-VS, both in terms of accuracy and 
precision as compared to the Lasso scheme. Most of the models produced by Lasso are far 
away from the frontier, and on the other hand, MOGA-VS frontier always passes close to 
the true model. We also execute the stepwise regression model on the simulated data, and 
the average number of predictor variables chosen by the method from 10 different datasets 
is 14.70, when the true model has 10 variables. 

The results produced by MOGA-VS on this simulated example demonstrates its superi- 
ority over other state of the art schemes, in helping the decision maker choose an appropriate 
model. It is not a wise idea even in simple simulated problems to rely entirely on variable 
selection schemes like stepwise regression methods or some information criteria. Methods 
like Lasso are capable of generating a frontier of solutions, however, as we observe from the 
results, the models are not necessarily even close to efficient. Best subset methods might 
be an alternative for producing efficient models, however, with large number of variables it 
is not feasible to evaluate all the possible models before deciding on the efficient ones. We 
provide a more detailed discussion and comparison results on the communities and crime 
example in the next sub-section. 

5.2 Communities and crime 



The communities and crimes dataset (Redmondj ( 20091 )) is formed as a combination of the 



socio-economic and law enforcement data from the 1990 US Census. The data also includes 



crime statistics from the 1995 FBI Uniform Crime Report. As discussed by lRedmond and Baveja 



(120021 ). the data set was originally collected to create a data-driven software tool called Crime 
Similarity System (CSS) for enabling cooperative information sharing among police depart- 
ments. The idea in CSS is to utilize a variety of context variables ranging from socioeconomic, 
crime and enforcement profiles of cities to generate a list of communities that should be good 
candidates to co-operate due to their similar crime profiles. 

To demonstrate the performance of MOGA-VS framework, we consider the data-mining 
task of finding variables that best predict how many violent crimes are committed per lOOK 
people. The number of candidate variables is 122, which corresponds to a hypothesis space 
% of size 2^^^. All of the variables have been normalized into [0, 1] interval to put all data 
into the same relative scale. The number of observations (^ or cases) is 1994, and e ach o bser- 



vation represents a single city or community. According to lRedmond and Bavejal ( 120021 ). the 
variables have been chosen in close co-operation with police departments to find a collection 
of factors that provide a good coverage of the different aspects related to the community 
environment, law enforcement and crime. However, some of the variables included in the 
data set could not be used directly "as is" due to the large number of missing values. To 
alleviate this, imputation techniquqj was used to replace missing values on 20 attributes. 

The MOGA-VS algorithm used the following parameter values: Population size: A^ = 
122, Maximum number of iterations: i = 500, Crossover probability: pc = 0.9, Mutation 
probability: Pm = 1/122, No. of offsprings: X = N. 



^The imputation was performed using the method imputeData I S chafed ( 19971 )) in the mclust-hbrary on 
R. 
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5.2.1 Analysing the Pareto-optimal frontier 
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Figure 5: Efficient regression models: Mean squared error of the models has been shown on 
the y-axis and number of coefficients on the x-axis. 

A description of the Pareto-optimal frontier is provided in Figure [5l The plot shows 
the progress of the MOGA-VS algorithm when all the 1994 observations are considered. In 
addition to the final frontier, snapshots of intermediate generations are shown to illustrate 
the convergence towards the optimal front. The algorithm is able to provide a good approx- 
imation of the Pareto-frontier already by the 100th generation. However, more generations 
are needed to ensure convergence to the true frontier. The final result is the set of non- 
dominated solutions obtained by the algorithm after executing it for 500 generations. The 
plot for generation 1 denotes the initial random models. It can be seen from the graph that 
the initial random models are initialized in the region close to 61 variables. The reason for 
this is that the initial bits are chosen to be either or 1 with a 50% probability. Therefore, 
the 122 bit chromosome has 61 number of expected variables. The algorithm is implemented 
on MATLAB, and required a total execution time of 37.27 minutes on a Linux machine with 
2.5 GHz Intel dual core processor and 2 GB or RAM. A total of 61,000 regression models 
were solved to arrive at the final frontier. 

Most of the times we are interested in parsimonious models, so initializing the population 
with fewer Is would boost the convergence. The convergence can be further enhanced by 
specifying constraints in the algorithm to perform a restricted search and producing only 
those models between i to j number of variables. Given a variable collection with 122 candi- 
dates, we would hardly want the final models to contain more than 20 variables. Therefore, a 
faster approximation of the interesting region of the Pareto-optimal frontier can be obtained 
by restricting the search for models with size between 1 to 20. Instead of starting the MOGA- 
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VS algorithm with a random population, it is also possible to start the algorithm with close 
Pareto-solutions as initial population. One of the stepwise selection techniques could be 
first executed on the dataset to get the trajectory of the stepwise approach. Thereafter, 
the trajectory models could be used to generate the starting population for the MOGA-VS 
approach. Trajectory models could be a much better starting guess as compared with a 
random population. However, in this paper we do not use any starting guesses to justify 
that the MOGA-VS alone could lead to a diverse set of Pareto-optimal solutions. 

Based on visualization of the frontiers, we find that the knee of the curve lies in the region 
of 5 to 15 variables. The models which explain most of the variation in the response variable 
are the ones in the knee region. The incremental contributions of the remaining combina- 
tions of 112 variables are relatively small. This means that incorporating more explanatory 
variables would lead to only minor additional explanation of the variation. Choosing one of 
the models from the knee region offers a good compromise between goodness of fit and com- 
plexity. In the Tabled] we provide the HS-plot which shows all non-dominated models with 
5 to 15 number of variables produced by the MOGA-VS algorithm. The variables which are 
present in the model are marked as 1 and the others are marked as 0. This chart provides a 
useful information as to when the size of the model is increased by 1 which variable (s) enter 
the model and which variable(s) are eliminated from the model. Consider a scenario, when 
a model size is increased from A; to A; -|- 1 causing one variable to leave the model and two 
variables to enter the model. It suggests that the explanatory power of the two variables 
entering the model is more than the explanatory power of the variable leaving the model 
when the remaining k — 1 variables are kept intact. The chart helps a user to build an insight 
about the problem and enhances his understanding in order to choose a regression model 
wisely. After having the background information provided by the MOGA-VS algorithm, one 
can proceed to use a strategy for model selection. In the next sub-section we discuss the 
results obtained by other variable selection strategies. 

5.2.2 Results obtained from other techniques 

In this section, we present the results from other state-of-the-art techniques used for variable 
selection. Figure E] shows the frontier ob t ained using MOGA-VS against the frontier pro- 



duced by the Lasso scheme of iTibshiranil (jl996r ). The Lasso frontier is obtained by solving 



single objective optimization problems with different parameter valueCl- Along with the two 
frontiers, the figure also shows the trajectory for a stepwise regression scheme, which is found 
to be close to the frontiers. The model shown with a cross is the final model chosen by the 
stepwise regression method. The initial points for Lasso and Stepwise method are not visible 
in the figure as they have a high MSE value. Figure [7] shows the models obtained using the 
BMA approach for two different parameter values for the leaps algorithmic, i.e. nhest = 1 
and nhest = 10. The parameter nbest represents the number of models for each variable 
size to be generated by the leaps algorithm. The results produced in the first two figures are 
obtained by utilizing the entire data-set for training. In both figures, we observe that the 
models produced by Lasso, Stepwise and BMA are dominated by the MOGA-VS frontier. 



^The Lasso parameter was incremented from in steps of 0.01, and a singe objective optimization problem 
was solved for each parameter until a model is obtained which includes all the variables. 
*http://cran.r-project.org/web/packages/BMA/BMA.pdf 
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Table 1: Models in the knee region of the Pareto-optimal frontier. 



racepctblack 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Pctllleg 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


PctPersDenseHous 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


HousVacant 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


MalePctDivorce 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


pctWWage 





1 


1 


1 


1 


1 


1 


1 


1 








pctUrban 








1 


1 


1 


1 


1 


1 


1 


1 


1 


NumStreet 











1 


1 


1 


1 


1 


1 


1 


1 


numbUrban 














1 








1 


1 


1 


1 


RentLowQ 

















1 


1 


1 


1 


1 


1 


MedRent 

















1 


1 


1 


1 


1 


1 


MedOwnCost...Mtg 




















1 


1 


1 


1 


1 


PctWorkMom 


























1 


1 


1 


pctWSocSec 





























1 


1 


PctKids2Par 





























1 


1 


LemasSwFTFieldOps 
































1 




No. of Variables 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


MSE X 100 


2.00 


1.92 


1.89 


1.88 


1.86 


1.85 


1.84 


1.83 


1.82 


1.81 


1.80 



17 



0.026 


* 








+ MOGA-VS 


0.025 




o 
o 






o Stepwise 
n Lasso 


0.024 






0.023 




. 






uj 0.022 

CO 

s 




°°o 




o 




0.021 




* 







0.02 




* 


o 

° o 




0.019 
0.018 






**SS° 


o 

° ° ° o 

° ° ° ° o. 






5 


10 15 
Number of Variables 


20 2, 



0.0184 


_» 




+ 
+ 


+ 
+ 


4 


+ 
+ 

t 
i 


* MOGA-VS 
BMAnbest=10 
BMA nbest=1 


0.0183 
0.0182 


+ 

* 

+ i 


0.0181 






* 


* 






- 


0.018 










■* 




' 


0.0179 












* 


* 


0.0178 














+ 



12 13 14 15 16 17 18 
Number of Variables 



19 20 



Figure 6: The figure shows a part of the 
MOGA-VS frontier, a part of the Lasso fron- 
tier and Stepwise trajectory obtained using 
the entire communities and crime data as 
training set. 



Figure 7: The figure shows a part of the 
MOGA-VS frontier and the BMA results for 
two different parameters values obtained us- 
ing the entire communities and crime data as 
training set. 



To examine the sensitivity of the model selection techniques for the choice of training and 
evaluation data, we proceed with another experiment, where the original data-set is divided 
into training and evaluation set. To obtain the average results, we create multiple test-sets 
of training and evaluation data by randomly choosing 50% of the rows from the original 
data-set as training set and the remaining rows as evaluation set. Aggregated results of the 
randomization experiment are furnished in Tables [2] and |3l which provide a performance 
metric for all the methods across 20 different test sets of training and evaluation data. For 
the i^^ test-set we generate the frontiers using one of the methods, and calculate the average 
MSE (say f^'^^^^"'^^ for a part of the frontier modelqj. The comparison metric is computed 
by taking an average of k,'^^^^°<^ across 20 test sets (say k™'^*'^'"^) for each of the methods. 
Lower value for 1^"^'^^^°'^ denotes a better performance. We conclude from the results that 
the best-fit models for the training set perform better even on evaluation set, but this may 
not always be true. The performance metric denotes a slightly better performance for the 
MOGA-VS algorithm on the training sets as well as the evaluation sets. 

Figures [H] and [3 provide the results on the evaluation data-set for MOGA-VS, Stepwise 
Regression, Lasso and BMA for a particular test-set out of 20 randomly generates test-sets. 
The cross-mark on Figure |8] is the model suggested by the stepwise regression method. We 
can observe from the graphs that the frontier for MOGA-VS is slightly ahead of the other 
frontiers particularly towards models with smaller number of variables. 

In Table H] we provide the results for BMA, PGA and stepwise regression methods. A 



^The reason for considering only a part of the frontier is because not all the methods produce models 
across the entire frontier. Stepwise trajectory contains models from 1 to 25 variables and BMA contains 
models from 6 to 20 variables. 



Table 2: Values for k^^ogavs^ ^Lasso ^^^ ^stepwise computed from 20 frontiers. Models 
containing 1 to 25 number of variables were considered while taking the average. 
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Table 3: Values for n^iOGAvs^ ^BMA{nbest=i) ^^^ ^BMA{nbest=io) computed from 20 frontiers. 
Models containing 6 to 20 number of variables were considered while taking the average. 
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Figure 8: The figure shows a part of the Figure 9: The figure shows a part of the 

MOGA-VS frontier, Lasso frontier, and Step- MOGA-VS frontier, and BMA results for two 

wise trajectory on the evaluation set when different parameters on the evaluation set 

50% of the communities and crime data is when 50% of the communities and crime data 

used as training set and the remaining 50% is used as training set and the remaining 50% 

as evaluation set. as evaluation set. 
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direct comparison with the MOGA-VS results can be obtained by comparing Table H] against 
Table [H The table shows the variables which are present in different models along with 
coefficient sign patterns. A plus sign indicates a positive coefficient for the variable, a 
negative sign indicates a negative coefficient for the variable, and no sign indicates that 
the variable is absent in the model. For BMA we have presented the top 5 models ranked 
by posterior probabilities. The best model proposed by BMA and the model proposed by 
PGA has 14 and 12 number of variables respectively. Both of these models lie close to the 
knee region of the Pareto-frontier. On the other hand Stepwise regression proposes a model 
with 23 variables which could be rejected. A closer look at the table shows that the models 
proposed by BMA and PGA agree with each other and contain mostly common variables. 
If the user wants lesser complex model with less than 12 variables, then the models in the 
knee region of the Pareto-frontier offer relevant alternatives. Finally, we would like to end 
the discussion without suggesting one particular model as the best regression model for the 
Communities and Crime example, as it not possible to suggest one best solution in the 
existence of trade-offs. It is ultimately the user who needs to choose a compromise solution 
which is most suitable for his purposes 

Table 4: Experiment Results: Model variables, + denotes a positive coefficient, - denotes a 
negative coefficient and no sign denotes that the variable is absent in the model 
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To summarize, most of the existing methods have their own merits and demerits. Lasso 
solves a single objective optimization technique and produces a point close to the frontier. 
However, as we see from the results, the models obtained using the Lasso scheme for small 
number of variables might be far away from the Pareto-frontier. Stepwise regression provides 
a trajectory which passes close to the frontier, but the final model proposed by the method 
may not be the most appropriate one. In the communities and crime example, we observe 
that the final model proposed by the stepwise regression method contains a large number of 
variables. The solutions produced by using the best subsets method could be an alternative, 
but it becomes excessively computationally expensive for high number of variables. Under 
such an uncertainty, we feel that the MOGA-VS approach would be a helpful tool as it 
provides the entire set of best-fit models, based on which the choice for the most appropriate 
model could be made. 

6 Conclusions 

In this paper, we have proposed a Multi-objective Genetic Algorithm for Variable Selection 
(MOGA-VS) which can be used for producing the entire set of efficient regression models. 
Once the efficient set of models is known, the most preferred model can be chosen by assessing 
these models. The proposed algorithm has been tested on a real data-set, and results have 
been presented. Comparison studies have been performed with state of the art techniques 
like Lasso, BMA, Stepwise regression methods and PGA. The results produced by MOGA- 
VS algorithm ensured a better goodness-of-fit on the test cases considered in the paper. 
Results obtained using various approaches support the knee region hypothesis. To conclude, 
MOGA-VS algorithm can prove to be a useful tool when there are many predictor variables, 
and a choice for a model with acceptable quality of fit and complexity is to be made. The 
frontier of solutions produced by the MOGA-VS scheme gives a visual impression to the 
entire model selection scheme and helps the user to make decisions efficiently. 
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